---
title: "Read Me First"
---

## Running Local Models with Cline

Local models have reached a turning point. For the first time, you can run Cline completely offline with genuinely capable models. No API costs, no data leaving your machine, no internet dependency.

The key is choosing the right model for your hardware and configuring it properly.

## What You Need to Know

### Hardware Requirements

Your RAM determines which models you can run:

| RAM Tier | Recommended Model | Quantization | What You Get |
| --- | --- | --- | --- |
| 32GB | Qwen3 Coder 30B | 4-bit | Entry-level local coding |
| 64GB | Qwen3 Coder 30B | 8-bit | Full Cline features |
| 128GB+ | GLM-4.5-Air | 4-bit | Cloud-competitive performance |

### The Model That Works: Qwen3 Coder 30B

After extensive testing, **Qwen3 Coder 30B** is the only model under 70B parameters that reliably works with Cline. It brings:

- 256K native context window
- Strong tool-use capabilities
- Repository-scale understanding
- Reliable command execution

Most smaller models (7B-20B) fail with Cline. They produce broken outputs, refuse to execute commands, or can't handle tool use properly.

### Critical Configuration

Getting local models to work requires specific settings:

**For LM Studio:**
1. Context Length: 262,144 (maximum)
2. KV Cache Quantization: OFF (critical)
3. Flash Attention: ON (if available)

**For All Local Models:**
- Enable "Use Compact Prompt" in Cline settings
- This reduces prompt size by 90% while maintaining core functionality
- Essential for local inference performance

### Quantization Explained

Quantization reduces model precision to fit on consumer hardware. Think of it as compression:

- **4-bit**: ~75% size reduction. Completely usable for coding tasks.
- **8-bit**: ~50% size reduction. Better quality, more nuanced responses.
- **16-bit**: Full precision. Matches cloud APIs but requires 4x the memory.

For Qwen3 Coder 30B:
- 4-bit: ~17GB download
- 8-bit: ~32GB download
- 16-bit: ~60GB download

### Model Format

Choose based on your platform:

**MLX (Mac only)**
- Optimized for Apple Silicon
- Leverages Metal and AMX acceleration
- Faster inference on M1/M2/M3 chips

**GGUF (Universal)**
- Works on Windows, Linux, and Mac
- Extensive quantization options
- Broader tool compatibility

## Performance Characteristics

Local models perform differently than cloud APIs:

**Expect:**
- Warmup time when first loading (normal, happens once)
- Slower inference than cloud models
- Context ingestion slows with very large repositories

**Don't Expect:**
- Instant responses like cloud APIs
- Unlimited context processing speed
- Zero configuration

## When Local Models Excel

Use local models for:

- Offline development where internet is unreliable
- Privacy-sensitive projects where code can't leave your environment
- Cost-conscious development where API usage would be prohibitive
- Learning and experimentation with unlimited usage

## When to Use Cloud Models

Cloud models still have advantages for:

- Very large repositories exceeding local context limits
- Multi-hour refactoring sessions needing maximum context
- Teams requiring consistent performance across different hardware
- Tasks requiring the absolute latest model capabilities

## Common Issues

**"Shell integration unavailable" or command execution fails**

Switch to a simpler shell in Cline settings. Go to Cline Settings → Terminal → Default Terminal Profile and select "bash". This resolves 90% of terminal integration problems.

**"No connection could be made"**

Your local server (Ollama or LM Studio) isn't running, or is running on a different port. Check that:
- The server is actually running
- The Base URL in Cline settings matches your server's address
- No firewall is blocking the connection

**Slow or incomplete responses**

This is normal for local models. They're significantly slower than cloud APIs. If it's too slow:
- Try a smaller quantization (4-bit instead of 8-bit)
- Reduce context window size
- Enable compact prompts if you haven't already

**Model seems confused or makes errors**

Ensure you have:
- Compact prompts enabled
- KV Cache Quantization disabled (LM Studio)
- Context length set to maximum
- Sufficient RAM for your chosen quantization

## Getting Started

1. **Choose your runtime**: [LM Studio](/running-models-locally/lm-studio) or [Ollama](/running-models-locally/ollama)
2. **Download Qwen3 Coder 30B** in the appropriate quantization for your RAM
3. **Configure critical settings** as outlined above
4. **Enable compact prompts** in Cline settings
5. **Start coding** offline

## The Reality of Local Models

Local models are now genuinely useful for coding tasks, but they're not magic. You're trading some convenience and speed for privacy and cost savings. The setup requires attention to detail, and performance won't match top-tier cloud APIs.

But for the first time, you can run a capable coding agent entirely on your laptop. That's a significant milestone.

## Need Help?

- Join our [Discord](https://discord.gg/cline) community
- Visit [r/cline](https://www.reddit.com/r/CLine/) on Reddit
- Check the [LM Studio guide](/running-models-locally/lm-studio) for detailed setup
- See the [Ollama guide](/running-models-locally/ollama) for alternative setup
